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Introduction: 

When  a  speech  signal  is  obscured  by  a  second  simultaneous  competing  speech  signal,  two  types 
of  masking  contribute  to  overall  performance.  Traditional  “energetic”  masking  occurs  when  both 
utterances  contain  energy  in  the  same  critical  bands  at  the  same  time  and  portions  of  one  or  both  of  the 
speech  signals  are  rendered  inaudible  at  the  periphery.  Higher-level  “informational  masking"  occurs  when 
the  signal  and  masker  are  both  audible  but  the  listener  is  unable  to  disentangle  the  elements  of  the  target 
signal  from  a  similar-sounding  distracter.  Because  “informational  masking”  is  restricted  to  cases  where 
the  masking  signal  is  similar  to  the  target  signal,  it  has  a  much  greater  impact  on  performance  when  a 
speech  signal  is  masked  by  speech  than  it  does  when  a  speech  signal  is  masked  by  noise.  Furthermore,  its 
effects  depend  specifically  on  the  characteristics  of  the  target  and  masking  speech  signals.  This  brief 
chapter  outlines  the  results  of  some  recent  experiments  we  have  conducted  in  our  laboratory  that  have 
examined  the  role  that  informational  masking  plays  in  speech  perception  and  attempted  to  isolate  the 
effects  that  informational  and/or  energetic  masking  have  on  multitalker  listening. 

Methods: 

All  of  the  experiments  described  in  this  chapter  were  conducted  using  the  Coordinate  Response  Measure 
(CRM).  In  the  CRM  task,  a  listener  hears  one  or  more  simultaneous  phrases  of  the  form  “Ready,  (Call 
Sign),  go  to  (color)  (number)  now”  with  one  of  eight  call  signs  (“Baron,”  “Charlie,”  “Ringo,”  “Eagle,” 
“Arrow,”  “Hopper,”  “Tiger,”  and  “Laker”),  one  of  four  colors  (red,  blue,  green,  white),  and  one  of  eight 
numbers  (1-8).  Researchers  at  the  Air  Force  Research  Laboratory  have  made  a  corpus  of  CRM  speech 
materials  available  to  the  public  on  CD-ROM  (Bolia  et  al.,  2000).  This  corpus  contains  all  256  possible 
CRM  phrases  (8  call  signs  X  4  colors  X  8  numbers)  spoken  by  each  of  eight  different  talkers  (four  male, 
four  female).  In  the  experiments  described  here,  the  stimulus  always  consisted  of  a  combination  of  a  target 
phrase,  which  was  randomly  selected  from  all  of  the  phrases  in  the  corpus  with  the  call  sign  “Baron,”  and 
one  or  more  masking  phrases,  which  were  randomly  selected  from  the  phrases  in  the  corpus  with  different 
call  signs,  colors,  and  numbers  than  the  target  phrase.  The  listener’s  task  was  to  listen  for  the  phrase 
containing  the  pre-assigned  target  call  sign  “baron”  and  respond  with  the  color  and  number  combination 
contained  in  that  phrase.  These  stimuli  were  presented  over  headphones  at  a  comfortable  listening  level 
(approximately  70  dB  SPL),  and  the  listener's  responses  were  collected  either  by  using  the  computer  mouse 
to  select  the  appropriately  colored  number  from  a  matrix  of  colored  numbers  on  the  CRT  or  by  pressing  an 
appropriately  marked  key  on  a  standard  computer  keyboard. 

Factors  that  influence  informational  and  energetic  masking  in  speech  perception: 

Figure  1  shows  performance  in  the  CRM  listening  task  with  five  different  maskers:  speech-spectrum¬ 
shaped  noise  that  has  been  amplitude  modulated  to  match  the  intensity  fluctuations  that  occur  in  normal 
speech  (TM);  continuous  speech-spectrum-shaped  noise  (TN);  and  a  different-sex,  same-sex,  and  same- 
talker  speech  signal  (TD,  TS  and  TT,  respectively).  The  results  shown  in  this  figure  highlight  three 
important  characteristics  of  informational  masking  in  speech  perception: 

1.  The  difference  between  speech-in-noise  and  speech-on-speech  masking:  The  two  noise  conditions 
shown  in  Figure  1  (TM  and  TN)  are  fundamentally  different  from  the  speech  conditions  in  two  important 
ways.  First,  performance  with  the  noise  maskers  tends  to  remain  at  a  high  level  at  much  lower  SNR  levels 
than  performance  with  the  speech  maskers.  Second,  once  the  SNR  does  become  low  enough  to  degrade 
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Figure  1:  Color  and  number  identifications  as  a  function  of  signal-to-noise  ratio  for  five  type  of 
masking  signals:  TM-  envelope-modulated  speech-shaped  noise;  TN-  continuous  speech-shaped 
noise;  TD-  a  different-sex  masking  phrase  from  the  CRM  corpus;  TS-  a  same-sex  masking  phrase 
from  the  CRM  corpus;  and  TT-  a  masking  phrase  from  the  CRM  corpus  spoken  by  the  same  talker 
used  in  the  target  phrase.  Adapted  from  Brungart  (2001). 

performance  with  the  noise  maskers,  performance  degrades  monotonically  and  precipitously  as  the  SNR  is 
further  reduced.  In  contrast,  performance  with  the  speech  maskers  (TD,  TS,  and  TM)  starts  to  degrade  at 
much  higher  SNRs  but  degrades  more  gradually,  especially  at  negative  SNR  values. 

2.  The  importance  of  voice  characteristics:  Performance  in  the  CRM  task  is  much  better  with  a  different- 
sex  interfering  talker  (TD)  than  with  a  same-sex  interfering  talker  (TS),  and  much  better  with  a  same-sex 
interfering  talker  than  with  a  masking  phrase  spoken  by  the  same  talker  used  in  the  target  phrase  (TT). 
Because  informational  masking  depends  on  the  relative  similarity  of  the  target  and  masking  voices, 
differences  in  voice  characteristics  can  be  a  powerful  cue  for  segregating  the  target  and  masking  speech 
signals. 

3.  The  advantage  of  level  differences:  In  contrast  to  performance  with  a  noise  masker,  which  degrades 
monotonically  as  the  SNR  decreases,  performance  with  a  same-sex  speech  masker  tends  to  plateau  around 
0  dB  SNR.  The  reason  for  this  plateau  in  performance  is  that  listeners  are  able  to  use  differences  in  the 
levels  of  the  two  talkers  to  distinguish  the  two  competing  voices  and  selectively  attend  to  the  quieter  of  the 
two  talkers  in  the  stimulus.  Thus,  especially  in  the  same  talker  (TT)  condition,  listeners  may  do  better  at 
negative  SNR  values  because  they  can  identify  the  target  as  the  quieter  talker  in  the  stimulus.  In  contrast, 
when  the  SNR  is  0  dB  in  the  TT  condition,  the  prosodic  and  coarticulative  features  that  connect  the  call 
sign  and  color  and  number  combination  in  the  target  phrase  are  the  only  available  features  to  allow  the 
listeners  to  discriminate  between  the  color  and  number  coordinates  in  target  and  masking  voices. 

Figure  2  shows  how  performance  in  the  CRM  listening  task  changes  as  additional  masking  talkers  are 
added  to  the  stimulus.  When  no  competing  talkers  were  present  in  the  stimulus,  performance  was  near 
100%.  The  first  competing  talker  reduced  performance  by  a  factor  of  approximately  0.4,  to  62%  correct 
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Figure  2:  Performance  in  a  diotic  CRM 
listening  task  with  0, 1, 2,  or  3  interfering 
same-sex  talkers. 


Figure  3:  Performance  in  a  CRM  listening 
task  with  0, 1, 2,  or  3  interfering  same-sex 
talkers,  presented  diotically  or  spatially 
separated  by  45  degrees. 


responses.  The  second  competing  talker  reduced  performance  by  another  factor  of  0.4,  to  38%  correct 
responses.  And  the  third  competing  talker  reduced  performance  by  another  factor  of  0.4,  to  24%  correct 
responses.  Thus  we  see  that  CRM  performance  in  a  diotic  multitalker  speech  display  decreases  by 
approximately  40%  for  each  additional  same-sex  talker  added  to  the  stimulus. 


In  general,  informational  masking  is  reduced  whenever  the  attributes  of  the  competing  talkers  are  made 
more  distinct  in  one  or  more  perceptual  dimensions.  One  veiy  powerful  way  to  distinguish  the  competing 
talkers  in  a  multitalker  stimulus  is  to  spatially  separate  the  apparent  locations  of  the  competing  talkers. 
Figure  3  shows  performance  in  the  CRM  task  with  1, 2,  or  3  competing  talkers  both  in  the  diotic  condition, 
where  the  talkers  were  presented  from  the  same  location,  and  in  a  spatial  condition,  where  the  talkers  were 
spatially  separated  by  45  degrees  in  azimuth.  In  the  case  with  one  interfering  talker,  spatial  separation 
increased  performance  by  approximately  25  percentage  points.  In  the  cases  with  two  or  three  interfering 
talkers,  spatial  separation  nearly  doubled  the  percentage  of  correct  responses.  These  results  clearly 
illustrate  the  substantial  decreases  in  informational  masking  that  spatial  separation  in  azimuth  can  produce 
in  multitalker  listening. 

Figure  4  shows  a  final  example  of  purely  informational  masking  in  dichotic  speech  perception.  In  this 
experiment,  the  normal  two-talker  same  sex  (TS)  CRM  speech  stimulus  was  presented  to  the  right  ear. 
However,  in  this  case,  an  additional  speech  noise  masker  was  presented  to  the  left  ear  (as  indicated  in  the 
legend).  The  listeners  were  instructed  to  ignore  the  left  ear  and  focus  only  on  the  right  ear.  The  results 
show  that  a  speech  signal  in  the  left  ear  interfered  substantially  with  performance  even  when  it  was 
presented  at  a  level  15  dB  below  the  level  of  the  target  talker  in  the  right  ear,  but  that  a  noise  signal  in  the 
left  ear  did  not  interfere  even  when  it  was  presented  at  a  level  20  dB  louder  than  the  target  speech  signal. 

In  this  case,  the  interference  that  occurred  in  the  contralateral  speech  conditions  was  purely  informational 
and  had  no  energetic  component.  Ongoing  research  in  our  laboratory  is  now  attempting  to  find  other  ways 
to  isolate  the  informational  and  energetic  components  of  speech  on  speech  masking.  Our  hope  is  that  this 
will  result  in  a  more  complete  understanding  of  the  informational  masking  that  occurs  in  speech  and,  in  the 
long  term,  a  significant  improvement  both  in  the  audio  displays  that  are  used  for  multichannel  speech 
communications  and  in  the  ability  of  automatic  speech  processing  systems  to  process  multitalker  speech 
signals. 
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Figure  4:  Performance  in  a  dichotic  CRM  listening  task  with  the  target  and  one  same-sex  talker 
presented  in  the  right  ear  and  a  masking  signal  (indicated  by  the  legend)  presented  in  the  left 
ear. 
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